Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.

Exploration of semantic-based taxonomies¶

The following notebook explores the use of semantic embeddings to create taxonomies of entities, with the goal of creating an ontology of the ARIA dataset. It leverages embeddings from the tags identified in articles extracted from OpenAlex, and uses them to cluster entities into groups. This taxonomy is assumed hierarchical.

Multiple options are explored to create the taxonomy, including:

  • Strict hierarchical clustering: Iteratively cluster the entity embeddings using Agglomerative Clustering and KMeans. At each level, clustering is performed on the subsets that were created at the previous level. This allows for strict hierarchical entity breakdowns.
  • Strict hierarchical clustering with varying number of clusters: Iteratively cluster the entity embeddings using KMeans. At each level, clustering is performed on the subsets that were created at the previous level. The number of clusters is proportional to the parent cluster size.
  • Fuzzy hierarchical clustering: Cluster the entity embeddings using an array of methods, including KMeans, Agglomerative Clustering, DBSCAN, OPTICS, and HDBSCAN. Several levels of resolution are used, and the taxonomy is built by concatenating these. This allows for non-strict hierarchical entity breakdowns.
  • Hierarchical clustering using a dendrogram climb algorithm: Reconstruct the dendrogram of the Agglomerative Clustering, and use it to create the taxonomy.
  • Hierarchical clustering using the centroids of parent-level clusters: Cluster the entity embeddings using KMeans, and use the centroids of the clusters as nodes in subsequent levels of the taxonomy.
  • Meta clustering using the co-occurrence of terms in the set of clustering results: Use outputs from all previous clustering methods to create a tag co-occurrence matrix. Apply community detection algorithms to this matrix to create the taxonomy.

The utils for this notebook include all necessary functions to create the taxonomy, including class ClusteringRoutine that performs any of the clustering methods described above. The function run_clustering_generators is used to run all clustering methods and return the results in a dictionary. The function make_dataframe is used to create a dataframe with the results of the clustering methods. The function make_plots is used to create a series of plots to visualize the results of the clustering methods. The function make_cooccurrences is used to create a co-occurrence matrix of the clustering results. The function make_subplot_embeddings is used to create a series of subplots with the embeddings of the entities in the taxonomy. [TODO] Pipeline should simply create the necessary dataframes and save them to S3. [TODO] Plots & validation / silhouettes should be in subfolder of the pipeline, called "validation" or "evaluation".

In [ ]:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
from IPython.display import display
import boto3, pickle, json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import umap.umap_ as umap
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from hdbscan import HDBSCAN
from itertools import product
from toolz import pipe
from collections import defaultdict
from itertools import chain
from functools import partial
from dap_aria_mapping import PROJECT_DIR, BUCKET_NAME, logger
from dap_aria_mapping.pipeline.semantic_taxonomy.utils import (
    make_subplot_embeddings,
    make_dataframe,
    make_plots,
    make_cooccurrences,
    run_clustering_generators
)

np.random.seed(42)

Obtain entities from OpenAlex¶

The entity tags are obtained from OpenAlex. Filtering is applied to remove entities that are too frequent or too infrequent. The entities are then embedded using the SPECTER model. In addition, two-dimensional representations of the embeddings are obtained using UMAP, for plotting purposes.

In [ ]:
s3 = boto3.client("s3")
try:
    try:
        logger.info("Downloading embeddings from S3")
        embeddings_object = s3.get_object(
            Bucket=BUCKET_NAME,
            Key="outputs/embeddings/embeddings.pkl"
        )
        embeddings = pickle.loads(embeddings_object["Body"].read())
    except:
        logger.info("Failed to download from S3. Attempting to load from local instead")
        with open(f"{PROJECT_DIR}/outputs/embeddings.pkl", "rb") as f:
            embeddings = pickle.load(f)
except:
    logger.info("Failed to load embeddings. Running pipeline with default (test) parameters")
    import subprocess
    subprocess.run(
        f"python {PROJECT_DIR}/dap_aria_mapping/pipeline/embeddings/local_export.py", 
        shell=True
    )
    with open(f"{PROJECT_DIR}/outputs/embeddings.pkl", "rb") as f:
        embeddings = pickle.load(f)

embeddings = pd.DataFrame.from_dict(embeddings).T
embeddings = embeddings.iloc[:500_000]

# UMAP
params = [
    ["n_neighbors", [20]],
    ["min_dist", [0.1]],
    ["n_components", [8]],
]

keys, permuts = ([x[0] for x in params], list(product(*[x[1] for x in params])))
param_perms = [{k: v for k, v in zip(keys, perm)} for perm in permuts]

for perm in param_perms:
    embeddings_2d = umap.UMAP(**perm).fit_transform(embeddings)
    fig = plt.figure(figsize=(10, 10))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=1)
    fig.suptitle(f"{perm}")
    plt.show()
2023-01-03 00:49:46,696 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2023-01-03 00:49:46,806 - dap_aria_mapping - INFO - Downloading embeddings from S3
2023-01-03 00:49:47,965 - dap_aria_mapping - INFO - Failed to download from S3. Attempting to load from local instead

Strictly hierarchical clustering¶

The following clustering routine iteratively clusters the entity embeddings using Agglomerative Clustering and KMeans. At each level, clustering is performed on the subsets that were created at the previous level.

In [ ]:
cluster_configs = [
    [
        KMeans,
        [
            {"n_clusters": 5, "n_init": 5},  # parent level
            {"n_clusters": 5, "n_init": 5}, # nested level 1
            {"n_clusters": 5, "n_init": 5}, # nested level 2
            {"n_clusters": 10, "n_init": 5}  # nested level 3
        ],
    ],
    [
        AgglomerativeClustering,
        [
            {"n_clusters": 5}, # parent level
            {"n_clusters": 5}, # nested level 1
            {"n_clusters": 5}, # nested level 2
            {"n_clusters": 10}  # nested level 3
        ],
    ],
]
# run clustering generators
cluster_outputs_s, plot_dicts = run_clustering_generators(cluster_configs, embeddings)
In [ ]:
# plot results
fig, axis = plt.subplots(2, 4, figsize=(32, 16), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
    _, lvl = divmod(idx, 4)
    make_subplot_embeddings(
        embeddings=embeddings_2d,
        clabels=[int(e) for e in cdict.values()],
        axis=axis.flat[idx],
        label=f"{cluster[-1]} {str(lvl)}",
        s=4,
    )
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_strict.png")

# print silhouettes
for output in cluster_outputs_s:
    print(
        "Silhouette score - {} clusters - {}: {}".format(
            output["model"][-1].__module__,
            output["model"][-1].get_params()["n_clusters"],
            output["silhouette"],
        )
    )
Silhouette score - sklearn.cluster._kmeans clusters - 5: [0.019219099]
Silhouette score - sklearn.cluster._kmeans clusters - 5: [0.019219099, 0.012733343]
Silhouette score - sklearn.cluster._kmeans clusters - 5: [0.019219099, 0.012733343, 0.011935249]
Silhouette score - sklearn.cluster._kmeans clusters - 10: [0.019219099, 0.012733343, 0.011935249, 0.00047204847]
Silhouette score - sklearn.cluster._agglomerative clusters - 5: [0.0070309527]
Silhouette score - sklearn.cluster._agglomerative clusters - 5: [0.0070309527, -6.916772e-05]
Silhouette score - sklearn.cluster._agglomerative clusters - 5: [0.0070309527, -6.916772e-05, 0.0060779215]
Silhouette score - sklearn.cluster._agglomerative clusters - 10: [0.0070309527, -6.916772e-05, 0.0060779215, 0.04126006]

Strict hierarchical clustering with imbalanced nested clusters¶

The following clustering routine iteratively clusters the entity embeddings using KMeans. At each level, clustering is performed on the subsets that were created at the previous level. The number of clusters at each level is allowed to vary, being determined by the size of the parent cluster.

In [ ]:
cluster_configs = [
    [
        KMeans,
        [
            {"n_clusters": 5, "n_init": 5}, # parent level
            {"n_clusters": 20, "n_init": 5},# nested level 1, total n_clusters is 20+
            {"n_clusters": 20, "n_init": 5},# nested level 2, total n_clusters is 20+
            {"n_clusters": 40, "n_init": 5},# nested level 2, total n_clusters is 40+
        ],
    ],
]
# run clustering generators with imbalanced nested clusters
cluster_outputs_simb, plot_dicts = run_clustering_generators(cluster_configs, embeddings, imbalanced=True)
In [ ]:
# plot results
fig, axis = plt.subplots(1, 4, figsize=(32, 8), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
    labels = [int(e) for e in cdict.values()]
    di = dict(zip(sorted(set(labels)), range(len(set(labels)))))
    labels = [di[label] for label in labels]
    _, lvl = divmod(idx, 4)
    make_subplot_embeddings(
        embeddings=embeddings_2d,
        clabels=labels,
        axis=axis.flat[idx],
        label=f"{cluster[-1]} {str(lvl)}",
        s=4,
    )
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_strict_imb.png")

# print silhouettes
for output in cluster_outputs_simb:
    print(
        "Silhouette score - {} clusters - {}: {}".format(
            output["model"][-1].__module__,
            output["model"][-1].get_params()["n_clusters"],
            output["silhouette"],
        )
    )
Silhouette score - sklearn.cluster._kmeans clusters - 5: [0.0192222]
Silhouette score - sklearn.cluster._kmeans clusters - 4: [0.0192222, 0.007234139]
Silhouette score - sklearn.cluster._kmeans clusters - 2: [0.0192222, 0.007234139, 0.005260979]
Silhouette score - sklearn.cluster._kmeans clusters - 3: [0.0192222, 0.007234139, 0.005260979, 0.0042145625]

Fuzzy hierarchical clustering¶

The following approach iteratively clusters the entity embeddings using any sklearn method that supports the predict_proba method. No notion of level exists through this approach: more fine-grained clusterings are agnostic about the parent cluster output. Including several lists of parameter values will produce outputs for the Cartesian product of all parameter values within a clustering method.

In [ ]:
cluster_configs = [
    [KMeans, [{"n_clusters": [5, 5, 5, 10], "n_init": 5}]], # level 1, level 2, level 3, level 4
    [AgglomerativeClustering, [{"n_clusters": [5, 5, 5, 10]}]], # level 1, level 2, level 3, level 4
    [DBSCAN, [{"eps": [0.15, 0.25], "min_samples": [8, 16]}]], # level 1, level 2, level 3, level 4
    [HDBSCAN, [{"min_cluster_size": [4, 8], "min_samples": [8, 16]}]], # level 1, level 2, level 3, level 4
]

# run clustering generators with fuzzy clusters
cluster_outputs_f_, plot_dicts = run_clustering_generators(cluster_configs, embeddings)
In [ ]:
# plot results
fig, axis = plt.subplots(4, 4, figsize=(40, 40), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
    make_subplot_embeddings(
        embeddings=embeddings_2d,
        clabels=[int(e) for e in cdict.values()],
        axis=axis.flat[idx],
        label=f"{cluster[-1].__module__}",
        cmap="gist_ncar",
    )
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_fuzzy.png")

# print silhouettes
for cluster in cluster_outputs_f_:
    print(
        "Silhouette score - {}: {}".format(
            cluster["model"][-1], cluster["silhouette"]
        )
    )
Silhouette score - KMeans(n_clusters=5, n_init=5): [0.019146662]
Silhouette score - KMeans(n_clusters=5, n_init=5): [0.019120798]
Silhouette score - KMeans(n_clusters=5, n_init=5): [0.019562474]
Silhouette score - KMeans(n_clusters=10, n_init=5): [0.013820635]
Silhouette score - AgglomerativeClustering(n_clusters=5): [0.0070309527]
Silhouette score - AgglomerativeClustering(n_clusters=5): [0.0070309527]
Silhouette score - AgglomerativeClustering(n_clusters=5): [0.0070309527]
Silhouette score - AgglomerativeClustering(n_clusters=10): [-0.014145394]
Silhouette score - DBSCAN(eps=0.15, min_samples=8): [0]
Silhouette score - DBSCAN(eps=0.15, min_samples=16): [0]
Silhouette score - DBSCAN(eps=0.25, min_samples=8): [0]
Silhouette score - DBSCAN(eps=0.25, min_samples=16): [0]
Silhouette score - HDBSCAN(min_cluster_size=4, min_samples=8): [0.01672557]
Silhouette score - HDBSCAN(min_cluster_size=4, min_samples=16): [0.02083284]
Silhouette score - HDBSCAN(min_cluster_size=8, min_samples=8): [-0.02618753]
Silhouette score - HDBSCAN(min_cluster_size=8, min_samples=16): [-0.17779437]

Using dendrograms from Agglomerative Clustering (enforces hierarchy)¶

This approach uses a single run of any sklearn clustering method that supports a children_ attribute. The children_ attribute is used to recreate th dendrogram that produced the clustering, which is then used to create the taxonomy. The climbing algorithm advances one union of subtrees at a time. The number of levels is determined by the dendrogram_levels parameter.

In [ ]:
cluster_configs = [[AgglomerativeClustering, [{"n_clusters": 100}]]]

# run clustering generators with dendrograms
cluster_outputs_d, plot_dicts = run_clustering_generators(cluster_configs, embeddings, dendrogram_levels=6)
In [ ]:
# plot results
fig, axis = plt.subplots(2, 3, figsize=(24, 16), dpi=200)
for i, ax in zip(range(6), axis.flat):
    make_subplot_embeddings(
        embeddings=embeddings_2d,
        clabels=[int(e[i]) for e in cluster_outputs_d["labels"].values()],
        axis=ax,
        label=f"denrogram - level {i}",
        s=4,
    )
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_dendrogram.png")

Centroids of Kmeans clustering as children nodes for further clustering (à la job skill taxonomy)¶

This approach uses any number of nested KMeans clustering runs. After a given level, the centroids of the previous level are used as the new data points for the next level.

In [ ]:
cluster_configs = [
    [
        KMeans,
        [
            {"n_clusters": 200, "n_init": 5, "centroids": False},
            {"n_clusters": 50, "n_init": 5, "centroids": True},
            {"n_clusters": 20, "n_init": 5, "centroids": True},
            {"n_clusters": 5, "n_init": 5, "centroids": True},
        ],
    ],
]

# run clustering generators with centroids
cluster_outputs_c, plot_dicts = run_clustering_generators(
    cluster_configs, embeddings, embeddings_2d=embeddings_2d
)
# [HACK] flip order, should be fixed in run_clustering_generators (should run highest level → lowest level)
for output_dict in cluster_outputs_c:
    for k,v in output_dict["labels"].items():
        output_dict["labels"][k] = v[::-1]
    output_dict["silhouette"] = output_dict["silhouette"][::-1]
In [ ]:
# plot results
fig, axis = plt.subplots(1, 4, figsize=(32, 8), dpi=200)
for idx, cdict in enumerate(cluster_outputs_c):
    if not cdict.get("centroid_params", False):
        axis[idx].scatter(
            embeddings_2d[:, 0],
            embeddings_2d[:, 1],
            c=[e for e in cdict["labels"].values()],
            s=1,
        )
    else:
        axis[idx].scatter(
            cdict["centroid_params"]["n_embeddings_2d"][:, 0],
            cdict["centroid_params"]["n_embeddings_2d"][:, 1],
            c=cdict["model"][idx].labels_,
            s=cdict["centroid_params"]["sizes"],
        )
    print(f"Silhouette score ({idx}): {cdict['silhouette']}")
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_centroids.png")
Silhouette score (0): [0.032928478]
Silhouette score (1): [-0.0027630487, 0.032928478]
Silhouette score (2): [0.03735191, -0.0027630487, 0.032928478]
Silhouette score (3): [0.083295725, 0.03735191, -0.0027630487, 0.032928478]

Analysis¶

This section outputs silhouette scores for all relevant outputs above. It also constructs barplots of the cluster sizes for each level of the taxonomy across approaches.

In [ ]:
# Harmonize cluster outputs for analysis
# [HACK] - fix this. For exports, I create a single dictionary for the fuzzy clusters
cluster_outputs_f = []
for group in ["sklearn.cluster._kmeans", "sklearn.cluster._agglomerative"]:
    dict_group = {
        "labels": defaultdict(list),
        "model": [],
        "silhouette": [],
        "centroid_params": None
    }

    cluster_outpu = [x for x in cluster_outputs_f_ if x["model"][0].__module__ == group]
    for clust in cluster_outpu:
        for k, v in clust["labels"].items():
            dict_group["labels"][k].append(v[0])
        dict_group["model"].append("_".join([clust["model"][0].__module__.replace(".", ""), str(clust["model"][0].get_params()["n_clusters"])]))
        dict_group["silhouette"].append(clust["silhouette"][0])
    cluster_outputs_f.append(dict_group)
In [ ]:
strict_kmeans_df = make_dataframe(cluster_outputs_s[3], "_strict")
strict_agglom_df = make_dataframe(cluster_outputs_s[7], "_strict")
strict_kmeans_imb_df = make_dataframe(cluster_outputs_simb[-1], "_strict_imbalanced")
fuzzy_kmeans_df = make_dataframe(cluster_outputs_f[0], "_fuzzy")
fuzzy_agglom_df = make_dataframe(cluster_outputs_f[1], "_fuzzy")
dendrogram_df = make_dataframe(cluster_outputs_d, "")
centroid_kmeans_df = make_dataframe(cluster_outputs_c[-1], "_centroids", cumulative=False)
In [ ]:
make_plots(strict_kmeans_df)
Out[ ]:
In [ ]:
make_plots(strict_agglom_df)
Out[ ]:
In [ ]:
make_plots(strict_kmeans_imb_df)
Out[ ]:
In [ ]:
make_plots(fuzzy_kmeans_df)
Out[ ]:
In [ ]:
make_plots(dendrogram_df)
Out[ ]:
In [ ]:
make_plots(centroid_kmeans_df)
Out[ ]:

Silhouette Scores¶

In [ ]:
results = {
    "kmeans_strict": cluster_outputs_s[3]["silhouette"],
    "agglom_strict": cluster_outputs_s[7]["silhouette"],
    "kmeans_strict_imb": cluster_outputs_simb[-1]["silhouette"],
    "kmeans_fuzzy": cluster_outputs_f[0]["silhouette"],
    "agglom_fuzzy": cluster_outputs_f[1]["silhouette"],
    "agglomerative_dendrogram": cluster_outputs_d["silhouette"],
    "kmeans_centroid": cluster_outputs_c[-1]["silhouette"],
}

results = {"_".join([k,str(id)]): e for k,v in results.items() for id, e in enumerate(v)}

silhouette_df = pd.DataFrame(results, index=["silhouette"]).T.sort_values(
    "silhouette", ascending=False
)
display(silhouette_df)
silhouette
agglomerative_dendrogram_0 0.103336
kmeans_centroid_0 0.083296
agglom_strict_3 0.041260
agglomerative_dendrogram_1 0.039198
kmeans_centroid_1 0.037352
kmeans_centroid_3 0.032928
kmeans_fuzzy_2 0.019562
kmeans_strict_imb_0 0.019222
kmeans_strict_0 0.019219
kmeans_fuzzy_0 0.019147
kmeans_fuzzy_1 0.019121
kmeans_fuzzy_3 0.013821
kmeans_strict_1 0.012733
kmeans_strict_2 0.011935
kmeans_strict_imb_1 0.007234
agglom_fuzzy_0 0.007031
agglom_fuzzy_1 0.007031
agglom_fuzzy_2 0.007031
agglom_strict_0 0.007031
agglom_strict_2 0.006078
kmeans_strict_imb_2 0.005261
kmeans_strict_imb_3 0.004215
agglomerative_dendrogram_2 0.001632
kmeans_strict_3 0.000472
agglom_strict_1 -0.000069
kmeans_centroid_2 -0.002763
agglom_fuzzy_3 -0.014145
agglomerative_dendrogram_3 -0.020362
agglomerative_dendrogram_5 -0.035250
agglomerative_dendrogram_4 -0.037681

Meta Clustering¶

Following the approach of Juan in the AFS repository, we combine the clustering methods to produce a matrix of entity co-occurrences. The objective is to apply community detection algorithms on this.

In [ ]:
list_dfs = [
    strict_kmeans_df, 
    strict_agglom_df,
    strict_kmeans_imb_df, 
    fuzzy_kmeans_df, 
    fuzzy_agglom_df,
    dendrogram_df, 
    centroid_kmeans_df
]

meta_cluster_df = (
    pd.concat(list_dfs, axis=1)
    .reset_index()
    .rename(columns={"index": "tag"})
)

Cluster Co-occurrences¶

In [ ]:
cooccur_dict = make_cooccurrences(meta_cluster_df)
cooccur_df = pd.DataFrame.from_dict(cooccur_dict)
cooccur_df.head(10)
Out[ ]:
Comminution Chloroquine Eli Lilly Cation-exchange capacity Middle ear Influenza Association football Food allergy Monophyly Balaenoptera ... Equilibrative nucleoside transporter Sieve Quadratically constrained quadratic program Burgsvik Molly McGuire Ranitidine Ammonia Project of Translation from Arabic Ribonuclease Categorical variable
Comminution 30 2 22 3 3 1 15 1 2 2 ... 1 17 4 22 9 1 4 16 4 4
Chloroquine 2 30 2 19 2 1 2 2 9 18 ... 7 2 2 2 1 3 2 2 8 2
Eli Lilly 22 2 30 3 3 1 15 1 2 2 ... 1 18 4 24 12 1 4 14 4 4
Cation-exchange capacity 3 19 3 30 2 1 3 1 9 16 ... 6 3 3 3 2 1 3 3 9 3
Middle ear 3 2 3 2 30 2 3 6 3 3 ... 1 3 12 3 1 6 12 3 12 12
Influenza 1 1 1 1 2 30 1 11 2 2 ... 12 7 7 1 11 12 9 1 1 7
Association football 15 2 15 3 3 1 30 1 2 2 ... 1 11 4 15 8 1 4 20 4 4
Food allergy 1 2 1 1 6 11 1 30 1 1 ... 12 1 1 1 13 20 1 1 1 1
Monophyly 2 9 2 9 3 2 2 1 30 15 ... 1 2 2 2 1 1 2 2 2 2
Balaenoptera 2 18 2 16 3 2 2 1 15 30 ... 6 2 2 2 1 1 2 2 8 2

10 rows × 8049 columns